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Abstract 



The problem of optimizing unknown costly-to-evaluate functions has been studied for a long time in the context of 
Bayesian Optimization. Algorithms in this field aim to find the optimizer of the function by asking only a few function 
evaluations at locations carefully selected based on a posterior model. In this paper, we assume the unknown function 
is Lipschitz continuous. Leveraging the Lipschitz property, we propose an algorithm with a distinct exploration phase 
followed by an exploitation phase. The exploration phase aims to select samples that shrink the search space as 
much as possible. The exploitation phase then focuses on the reduced search space and selects samples closest to 
the optimizer. Considering the Expected Improvement (EI) as a baseline, we empirically show that the proposed 
algorithm significantly outperforms EI. 

1 Introduction 

In many applications such as nanotechnology and finance, etc, we want to optimize an unknown function /(•) that is 
costly to evaluate over a compact input space. Normal optimization methods cannot be applied to this type of problems 
since they either need to evaluate the function frequently or they require to know the gradient of the function at each 
point. In contrast, Bayesian optimization lHIH algorithms try to solve this problem with a small number of function 
evaluations. 

Bayesian optimization algorithms have two key components: 1) A posterior model to predict the output value of the 
function at any arbitrary input point, and 2) A selection criterion to determine which point is going to be evaluated next. 
The first step of a Bayesian optimization algorithm is to generate a posterior probabilistic model over unobserved points 
of the function. Gaussian process (GP) fTE] has been recently used in most of the literature of Bayesian Optimization 
as the probabilistic posterior model. In general, GP models the function output for any unobserved point in the input 
space as a normal random variable, whose mean and variance depend on the location of the point in relation to a set of 
observed samples. Based on the generated posterior model, a selection criterion is then used to choose the next sample 
to be evaluated. A number of selection criteria have been proposed in the literature of Bayesian Optimization. They 
typically work by selecting an example that optimizes some objective function designed to balance between exploring 
unobserved area and exploiting the promising observed parts of the input space. Maximum probability of improvement 
||8j 20] and maximum expected improvement (EI) [131 are two successful examples. 

In this paper, we focus on the design of the selection criterion for Bayesian Optimization (BO). In particular, we 
study Bayesian Optimization in a sequential setting lITTl [Tsl [19^, where the samples are chosen sequentially and a 
selection is made only after the function evaluations of the previous samples are revealed. We make a mild assumption 
that the unknown function is Lipschitz-continuous. Leveraging the Lipschitz property, we design a selection algorithm 
that operates in two distinct phases: the exploration phase and the exploitation phase. 

The exploration phase of the proposed algorithm, at each step, selects a sample that eliminates the largest possible 
portion of the input space while guaranteing with high probability that the eliminated part does not include the max- 
imizer of the function. Hence, the exploration stage of the algorithm tries to shrink the search space of the function 
as much as possible. In contrast, the exploitation phase of our algorithm selects the point which is believed to be the 
closest sample to the optimal point with high probability. Experimental results over 8 real and synthetic benchmarks 
indicate that the proposed approach is able to outperform the Expected Improvement (EI) criterion, one of the current 
state-of-the-art BO selection methods. In particular, we show that our algorithm is better than EI both in terms of the 
mean and variance of the performance. 

We also investigate whether combining our exploration stage with EI can boost the performance of EI. However, the 
results were negative. Sometimes it helps and sometimes it hurts and on average we observe little to no improvement 
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to EI. This is possibly because our exploration method actively aims to eliminate regions from the input space and the 
EI criterion does not take that into consideration when selecting samples. 

The remainder of the paper is organized as follows. In Section 2, we motivate the use of exploration-exploitation 
Bayesian Optimization by analyzing the behavior of EI. Section 3 introduces our algorithm and provides insights into 
both theoretical and practical aspects of the algorithm. Experimental evaluation of our algorithm is shown in Section 
4. Finally, the paper is concluded in Section 5. 



2 Motivating Observation 

In this section, we motivate our approach by revealing a key observation about the well known Expected Improvement 
(EI) algorithm [OJ. Using Gaussian Process (GP) 1 16] as the posterior model of the unknown function, the EI objective 
is defined as 

Elix\0) = {^^^la ~ 2/max)<i> " ^" 



ymax 



(1) 

^J'x\0 ~ Umax 



a. 



x\0 



where, [ix\o ™d '^x\o ^re the mean and standard deviation associated with the point x by GP, and, $(•) and </<(•) are 
standard Gaussian CDF and PDF, respectively. Here, O — {(xi, /(a;^))}"^]^ is the set of n observed samples xq with 
their function evaluations f{xo) and define j/max = maxaj^ga;^ f{xi). Further, the means and variances are defined as 
follows: 

fJ-x\o = k{x,xo) k{xo,xoy^ f{xo) 

'^l\o = ^) ~ ^(^' ^o) k{xo,xay^ k{xa,x), 

where k{-, •) is some kernel function. In this paper, we consider Gaussian kernel k(xi, X2) = exp(— — a;2||2)- 

EI has been widely used and studied; however, there has been always a concern about balancing the exploration and 
exploitation of EI. The main reason for this concern is that even though the asymptotic convergence of EI is guaranteed 
under certain conditions [21 J, if not explored well, EI can be trapped in a local minima when we have finite/limited 
number of samples. There has been some attempts in the literature to address this concern with varying degrees of 
success, which we briefly discuss here. 

(a) The original of EI is defined as 

EI{x) = E [{f{x) - t/max) hf(x)-y„,^^>0}] , 

where is the indicator function. Hence, it measures the expected improvement of the choice of x over 
the current maximum function evaluations i/max over observed samples. Researchers have proposed to replace 
J/max with a smaller value to make EI more exploitative and with a larger value to make it more explorative. In 
particular, Lizotte [12 1 suggested j/max + C and Azimi et al. [2J suggested {I + OVmax to replace j/max- However, 
this approach has not seen much empirical success. Lizotte lfT2]| showed that starting with large values of ^ (to 
be explorative in the beginning) and cooling it down (to make it more and more exploitative) makes little or no 
difference in the performance of EI. 

(b) On a separate line of work, Schonlau ifTSl proposed to consider a surrogate function 

EI^{x) = E [(/(x) - ymax)^ l{/(x)-y,„,,>0}] • 

For ^ = 1, this objective tries to improve over ymax (exploiting mode) and if we decrease ^ it starts to explore 
uncertain areas (exploration mode). This method is very sensitive to small changes in ^ and except for very 
specific setup like the one used in [,17 j . there is no systematic way to choose ^. This makes it nearly impossible 
to use this method. 

(c) The third proposal is to have a "random" exploration phase followed by EI. In this approach, we take a number 
of random samples before switching to EI. We analyzed this method in Fig.[T] For a fixed budget n^, we run rif, 
experiments as follows: first we consider the case where there is 1 random sample followed by rib — 1 samples 
selected by the EI criterion, next we consider the case where there are 2 random samples followed by — 2 EI 
samples and so on. The purpose of this investigation is to understand whether exploring with random samples 
prior to selecting with EI can improve the performance of EI, and if so how much exploring is necessary. We 
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Figure 1: Plot of regret versus the number of random exploration for EI algorithm. For a fixed budget ni,, we run a 
number of experiments as follows: first we consider the case where there are 1 random samples followed by n;, — 1 EI 
samples, next we consider the case where where there are 2 random samples followed by — 2 EI samples and so on. 
For 2D and 3D functions, we let ni, — 15 and for high-dimensional functions, we let rif, = 35. This result shows that 
the best EI performance is when we do not do random exploration. 

Algorithm 1 Next Best exploRative Sample (NBRS) 

Input: Maximum M, Lipschitz Constant L and Set of observed samples {{xi, /(xi)), . . . , {xt, f{xt))} 
Output: Next best explorative sample x 



argmax Vol ( DtPS ( a;, 



run this experiments on a number of different functions introduced in Section]?] These experiments reveal that 
"random" exploration never helps EI, since the regret monotonically increases as we increase the number of 
random samples from 1 to n;,. 

Based on the existing literature as well as our empirical investigation of EI discussed above, we would like to know 
whether or not it is possible to design an algorithm that operates in two naturally defined phases of exploration and 
exploitation and achieves consistently better performance than EI. We devote the next section to answer this question 
and introduce our proposed algorithm. 

3 Finite Horizon Bayesian Optimization 

Not being able to balance the exploration-exploitation, EI might have poor performance especially when the query 
budget is small. In this section, we propose a two-phase exploration/exploitation algorithm that outperforms EI with 
its smart exploration and exploitation. 



3.1 Exploration 

Generally, a good exploration algorithm should be able to shrink the search space, so that we are left with a small 
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region to focus on during the exploit stage. Let D = ^[a-i, h] e M'^ be the Cartesian product of intervals [a^, bi] for 
some flj < bi and i e {1. 2, . . . , d}. Suppose the unknown function / : D i-^ [m, A/] (with f{x*) — M) is a Lipschitz 
function over D with constant L, that is for all xi^X2 € D, we have 

\f{xi)- f{x2)\<L\\xi-X2\\2- 

Notice that if the function is not Lipschitz, then there is no hope that we can find the global optimum of /(•) even 
with infinitely countable evaluations. Thus, the Lipschitz continuity assumption is not a strong assumption. Moreover, 
functions with larger L are harder to optimize since they change more abruptly over the space. 

For any point a; € D, let r^. — be the associated radius to the point x. By Lipschitz continuity assumption, 

we know that x* ^ S(a;, r^.), where, S(a;, r^) is the set of all points inside the sphere (or circle) with radius centered 
at X (and single point x if < 0); otherwise, the Lipschitz assumption is violated. This means if we have a sample at 
point X, then we do not need any more samples inside E>{x, r^)- 

The expected value of satisfies E[rj.] = IM^tifii ^ Since f{x) is a normal random variable JV{fix, c^)^ using 
Hoeffding inequality for all e > 0, we have 



L 



< exp 



Replacing e with 1-5^, the above inequality entails that with high probability ( 99%), > ^ Hence, 

a "good" algorithm for exploration should try to find x that maximizes the lower bound on r^- This choice of x will 
remove a large volume of points from the search space. Note, however, if x is close to the boundaries of D, then 
it might be the case that most of the volume of the sphere lies outside D. Also, the sphere associated with x might 
have significant overlap with spheres of other points that are already selected. To fix this issue, we pick the point 
whose sphere has the largest intersection with unexplored search space in terms of its volume. The pseudo code of this 
method is described in Algorithm[T] which we refer to as the Next Best exploRative Sample (NBRS) algorithm. NBRS 
achieves the optimal exploration in the sense that it maximizes the expected explored volume. 

The value of |A/ — ji^] — IS^a^ might be negative, especially for large values of (Jx- This artifact happens at 
points X that are "far" from previously observed samples. To prevent/minimize this, we need to make sure that the 
observed samples affect the mean and variance of all points in the space. For example, if we use the Gaussian kernel 
fc (a; 1 , 2)2 ) — exp (— j-|ja;i— a::2||2)for exploration, then we need to choose large enough to make sure each observed 

sample affects all the points in the space, e.g., 1^ > Sf=i(^j ""^i)^- ^\ck small then the exploration algorithm 
starts exploring around the previous samples and extend the explored area gradually to reach to the other side of the 
search space. This strategy is not optimal if we have limited samples for exploration. 
To implement NBRS, we need to maximize the volume 



g{x) = Vol D< n § 



L 



where Dt represents the current unexplored input space. To evaluate g{x), we take a large number of points N inside 
the sphere S(x, ^^i"! i-5g-a;|o ^ ^JJjfQJ.JJJJy random. Then, for each point, we check if it crosses the borders [oj , bi] 
or falls into the spheres of previously observed samples. If not, we count that point as a newly explored point. Finally, 

|M-fe|o|-1.5a^ 



if there are n newly explored points, then we set g{x) « ^ i 

To optimize g{x), one can use deterministic and derivative free optimizers like DIRECT \ W\. The problem is that 
DIRECT only optimizes Lipschitz continuous functions; however, g{x) is not necessarily Lipschitz continuous. In our 
implementation, we take a large number of points inside Dt and evaluate g{-) at those points and pick the maximum. 
This method might be slower than DIRECT, but avoids inaccurate results of DIRECT especially when Dt describes a 
small region. 



3.2 Exploitation 

In the exploitation phase of the algorithm, we would like to use the information gained in the exploration phase to find 
the optimal point of /(•). Suppose we have explored the search space with t samples and we want to find x* € D^. In 
order to exploit, we would like to find points x whose sphere is small. The reason is thatifr, = ^^^^< 7 is small 
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Algorithm 2 Next Best exploitive Sample (NBIS) 



Input: Maximum M, Lipschitz Constant L and Set of observed samples {(xi, /(xi)), . . . , (xg, f{xq))} 
Output: Next best exploitive sample x 



y E,{xi,rj:.) 



i=l 

X < — argmin Vol S x, 



enough, then by local strong convexity of /(•) around x*, for some constant k we have 



\\x-x*\\l<M~f{x)<Lj. 



Following the argument in Section 3.1, we estimate by its mean E[rx] = lAL_i^d. gy Hoeffding inequality, for all 
e > 0, we have 



\M - 



L 



< exp 1^ 











Similarly, replacing e with 1-5^, the above inequality entails that with high probability ( 99%), < , 
Hence, a "good" algorithm for exploitation should try to find the point x that minimizes the upper bound on r^- This 
choice of x introduces the expected closest point to x* . We present the pseudo code of this method in Algorithm|2] 
The optimization in Algorithm|2]is nothing but minimizing 



h{x) 



\M - ^i^\o \ + 1.5o-a.|o 

L 



To optimize h{x), again we take a large number of points in Dg (the current unexplored space) uniformly at random 
and evaluate h{-) on those and pick the minimum. 



3.3 Exploration-Exploitation Trade-off 

The main algorithm consists of an initial exploration phase followed by exploitation. Notice that we are using GP 
as an estimate of the unknown function and our method, like EI, highly relies on the quality of this estimation. On 
a high level, if the function is very complex, i.e., has large Lipschitz constant L, then we need more exploration to 
fit better with GP. Small values of L correspond to flatter functions that are easier to optimize. Thus, in general, we 
expect the number of exploration steps to scale up with L. As a rule of thumb, functions we normally deal with satisfy 
2 < L < 20, for which we spend 20% of our budget in exploration and the rest in exploitation. 

We use different kernel widths for the exploration and exploitation phases. In the case of exploration for complex 
functions, if we have enough budget (and hence, enough explorative samples), the kernel width can be set to a small 
value to fit a better local GP model. However, if we do not have enough budget, we need to take the kernel width to be 
large. In the case of exploitation, we pick the kernel width under which EI achieves its best performance. 

Note that the choice of M and L plays a crucial role in this algorithm. If we pick L larger than the true Lipschitz 
function, then the radius of our spheres shrink and hence we might need more budget to achieve a certain performance. 
Choosing L smaller than the true Lipschitz is dangerous since it makes the spheres large and increases the chance of 
including the optimal point in a sphere and hence removing it. Thus, it is better to choose L slightly larger than our 
estimate of the true Lipschitz to be on the safe side. 

The method is less sensitive to the choice of M, since the derivative of the radius with respect to M is proportional 
to j-. Thus, as long as we do not over estimate M significantly, the factor prevents the spheres to become very large 
(and include/remove the optimal point). Small values of M, make the spheres smaller and hence, if we underestimate 
M, we would need more budget to achieve certain performance. However, if AI is significantly (proportional to L) 
smaller than the true maximum of the function, then the algorithm will look for the point that achieves M and hence 
will perform poorly. 
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Figure 2: The contour plots for the four 2— dimension proposed benchmarks. 



Table 1 : Benchmark Functions 



Cosines(2) 


1- {u^+v'^- 0.3cos{37ru)- 0.3 cos(37rt))) 
u = 1.6a; — 0.5, v = 1.6y — 0.5 


Rosenbrock(2) 


10-100(1/- x2)2_(l- xf 


Hartman(3,6) 


f^ix4i ^ixdy ^4 X d ars constants 


Michalewicz(5) 


-Ei=ism{x,)sm ( ^ 1 


Shekel(4) 
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4 Experimental Results 

In this section, we compare our algorithm with EI under different scenarios for different functions. We consider six 
well-known synthetic benchmark functions: 

(1,2) Cosines lUl and Rosenbrock H over [0, 1]^ 

(3,4) Hartman(3,6) Q over [0, 1]^'^ 

(5) Shekel Q over [3, 6]'' 

(6) Michalewicz OH over [0, tt]^ 

The mathematical expression of these functions are shown in Table 1 . Moreover, we use two benchmarks derived from 
real-world applications: 

(1) Hydrogen |6| over [0, 1]^ 

(2) Fuel Cell 121 over [O,!]^ 

The contour plots of these two benchmarks along with the Cosines and Rosenbrock benchmarks are shown in Fig |4] 
The Fuel Cell benchmark is based on optimizing electricity output of microbial fuel cell by modifying some nano 
structure properties of the anodes. In particular, the inputs that we try to adjust are the average area and average 
circularity of the nano tube and the output that we try to maximize is the power output of the fuel cell. We fit a 
regression model on a set of observed samples to simulate the underlying function /(•) for evaluation. The Hydrogen 
benchmark is based on maximizing the Hydrogen production of a particular bacteria by varying the PH and Nitrogen 
levels of its growth medium. A GP is fitted to a set of observed samples to simulate the underlying function /(•). We 
consider a Lipschitz constant L w 3 for all of the benchmarks, except for Cosines and Michalewicz with i w 6 and 
Rosenbrock with L w 45. For the sake of comparison, we consider the normalized versions of all these functions 
and hence M = 1 in all cases. As mentioned previously, we spend 20% of the budget on exploration and 80% on 
exploitation. 
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Function 


EI 




NBRS+EI 


NBRS+NBIS 


Cosines 


.0736 ± 


.016 


.1057 ± 


.029 


.0270 ± .009 


Fuel Cell 


.1366 ± 


.006 


.1357 ± 


.004 


.0965 ± .004 


Hydrogen 


.0902 ± 


.004 


.1149 ± 


.004 


.0475 ± .006 


Rosen 


.0134 ± 


.001 


.0163 ± 


.001 


.0034 ± .000 


Hart(3) 


.0618 ± 


.006 


.0450 ± 


.003 


.0384 ± 0.003 


Shekel 


.3102 ± 


.017 


.3011 ± 


.018 


.3240 ± .030 


Michal 


.5173 ± 


.010 


.5011 ±0.010 


.4554 ± 0.019 


Hart(6) 


.1212 ± 


.002 


.1235 ± 


.002 


.1020 ± .003 



Table 2: Comparison of the best results of EI, NBRS+EI and NBRS+NBIS. This result shows that our algorithm 
outperforms the other two counterparts significantly in most cases both in terms of the mean and variance of the 
performance. 

4.1 Comparison to EI 

In the first set of experiments, we would like to compare our algorithm with the best possible performance of EI. For 
each benchmark, we search over different values of the kernel width and find the one that optimizes El's performance. 
Fig. [Tjis plotted using these optimal kernel widths and shows that the best performance of EI happens when we take 
only one random sample from a given budget. This performance is then used as the baseline for comparison in Table 2. 
In light of the results of Fig. [T] we are also interested in whether our exploration algorithm can be used to improve the 
performance of EI. To this end, we replace the proposed exploitation algorithm with EI to examine if our exploration 
strategy helps EI. We refer to this setting as NBRS+EI. 

Table 2 summarizes the mean and variance of the performance, measured as the "Regret"= M — max f{xo), 
for different benchmarks estimated over 1000 random runs. It is easy to see that in all benchmarks, our algorithm 
(NBRS+NBIS) outperforms EI consistently except for the Shekel benchmark where EI and NBRS+EI have slightly 
better performances. We suspect this is due to the fact that we have not optimized our kernel widths, where as the EI 
kernel width is optimized. 

We also note that NBRS+EI does not lead to any consistent improvement over EI. This is possibly due to the fact 
that EI does not take advantage of the reduced search space produced by NBRS during selection. 

4.2 Exploration Analysis 

In the second set of experiments, we would like to compare our exploration algorithm NBRS with random exploration 
when using NBIS for exploitation. As discussed previously, both random exploration and NBRS fail to produce better 
performance when used with EI. Thus, it is interesting to see whether they can help NBIS in terms of the overall regret, 
and if so which one is more effective. Figure [3] summarizes this result for all benchmarks. For a fixed budget rib, we 
start with 1 explorative sample (either using NBRS or random) followed by rif, — 1 NBIS samples; next, we start with 
2 explorative samples followed by nb — 2 NBIS samples and so on. In each case, we average the regret over 1000 
runs. The black line corresponds to the NBRS exploration and the green line corresponds to the random exploration. 
We will discuss each function in more details later, but in general, this result shows that our exploration algorithm is 
a) better than random exploration and b) necessary. To see why it is necessary, notice that the minimum regret on all 
curves is achieved for a non-zero number of NBRS samples. This means unlike EI, our exploitation algorithm benefits 
from NBRS. 

Looking closer into the results, we see that NBRS always lead to a smaller regret comparing to the random ex- 
ploration. On the Shekel benchmark, we see that random exploration has better performance if we spend majority of 
the budget to explore. However, for a reasonable amount of exploration that leads to the minimum regret (5 to 10 
experiments), random exploration and NBRS achieve similar performance. 

On our 6-dimensional benchmark Hartman(6), we notice that random exploration and NBRS behave very similarly. 
This shows that the input space is so large that no matter how clever you explore, you will not likely to improve the 
performance for the limited budget of 35. 

NBRS starts from an initial point and explores the input space step by step. Imagine you are in a dark room with a 
torch in your hand and you want to explore the room. You start from an initial point and little by little walk through the 
space until you explore the whole space. This is exactly how NBRS does the exploration. Roughly speaking, NBRS 
minimizes iix\o + ^■^<^x\o ^nd hence, if a point is far from previous observations, i.e., ax\o is large, it is unlikely 
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Figure 3: Plot of regret versus the number of explorations for NBIS algorithm. For a fixed budget n;,, we run a number 
of experiments as follows: first we consider the case where there are 1 explorative sample (either random or NBRS) 
followed by rifc — 1 EI samples, next we consider the case where where there are 2 explorative samples followed by 
TT-b — 2 EI samples and so on. For 2D and 3D functions, we let nf, — 15 and for high-dimensional functions, we let 
rib — 35. This result shows that in most cases, our exploration is a) better than random, and b) necessary, since the 
regret achieves its minimum somewhere apart from zero. On average, we need to explore 20% of our budget, however, 
this portion can be optimized if we consider any specific function. 



to be chosen. We see this effect in all functions, but most clearly in the Michalewicz benchmark. When the number 
of explorative samples is smaller than 10, the step-by-step explore procedure cannot explore the whole space and the 
exploitation can be trapped in local minima. For 10 — 15 explorative samples, NBRS can walk through the entire space 
fairly well and hence we get a minimum regret. For more than 15 explorative samples, since the space is well explored, 
we are wasting the samples that could be potentially used to improve our exploitation and hence, the performance 
becomes worse. 

Finally, this investigation suggests that the result in Table 2 can be further improved by taking different number of 
explorative samples for different functions. To minimize parameter tuning, we chose to explore 20% of our budget. In 
general, this ratio can be adjusted according to the property of the function (e.g., the Lipschitz constant). 



5 Conclusion 

In this paper, we consider the problem of maximizing an unknown costly-to-evaluate function when we have a small 
evaluation budget. Using the Bayesian optimization framework, we proposed a two-phase exploration-exploitation 
algorithm that finds the maximizer of the function with few function evaluations by leveraging the Lipschitz property 
of the unknown function. In the exploration phase, our algorithm tries to remove as many points as possible from the 
search space and hence shrinks the search space. In the exploitation phase, the algorithm tries to find the point that is 
closest to the optimal. Our empirical results show that our algorithm outperforms EI (even in its best condition). 
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